Author Verification Using Common N-Gram Profiles of Text Documents
نویسندگان
چکیده
Authorship verification is the problem of answering the question whether or not a sample text document was written by a specific person, given a few other documents known to be authored by them. We propose a proximity based method for one-class classification that applies the Common N-Gram (CNG) dissimilarity measure. The CNG dissimilarity (Kešelj et al., 2003) is based on the differences in the frequencies of n-grams of tokens (characters, words) that are most common in the considered documents. Our method utilizes the pairs of most dissimilar documents among documents of known authorship. We evaluate various variants of the method in the setting of a single classifier or an ensemble of classifiers, on a multilingual authorship verification corpus of the PAN 2013 Author Identification evaluation framework. Our method yields competitive results when compared to the results achieved by the participants of the PAN 2013 competition on the entire set, as well as separately on two subsets — English and Spanish ones — out of the three language subsets of the corpus.
منابع مشابه
Reclaiming Individuality of Mysterious Passage
Abstract— Authorship attribution, the science of inferring characteristics of author from characteristics of documents written by that author become an urgent need to find the original author of anonymous text. In this paper, a novel approach is proposed that attempts to measure the style variation of author using character n-gram profiles. This proposed method is a different approach to identi...
متن کاملDiscrepancies Detection in Arabic and English Documents
In the paper, there are analyzed and compared results of usable methods for discrepancies detection based on character n-gram profiles (the set of character n-gram normalized frequencies of a text) for English and Arabic documents. English and Arabic texts were analyzed from many statistical characteristics point of view. We covered some statistical differences between both languages and we app...
متن کاملProximity Based One-class Classification with Common N-Gram Dissimilarity for Authorship Verification Task Notebook for PAN at CLEF 2013
We describe our participation in the Author Identification task of the PAN 2013 competition. This competition task presents participants with a set of authorship verification problems. In each such a problem, one is given a set of documents written by one author and a sample document; the task is to answer the question whether or not the sample document was written by the same author as the rem...
متن کاملIntrinsic Plagiarism Detection Using Character n-gram Profiles
The task of intrinsic plagiarism detection deals with cases where no reference corpus is available and it is exclusively based on stylistic changes or inconsistencies within a given document. In this paper a new method is presented that attempts to quantify the style variation within a document using character n-gram profiles and a style change function based on an appropriate dissimilarity mea...
متن کاملEnsembles of Proximity-Based One-Class Classifiers for Author Verification Notebook for PAN at CLEF 2014
We use ensembles of proximity based one-class classifiers for authorship verification task. The one-class classifiers compare, for each document of the known authorship, the dissimilarity between this document and the most dissimilar other document of this authorship to the dissimilarity between this document and the questioned document. As the dissimilarity measure between documents we use Com...
متن کامل